Extraction de motifs séquentiels dans les flux de données. (Sequential patterns mining from data streams)
نویسنده
چکیده
In recent years, many applications dealing with data generated continuously and at high speeds have emerged. These data are now quali ed as data streams. Dealing with potentially in nite quantities of data imposes constraints that raise many processing problems. As an example of such constraints we have the inability to block the data stream as well as the need to produce results in real time. Nevertheless, many application areas (such as bank transactions, Web usage, network monitoring, etc.) have attracted a lot of interest in both industry and academia. These potentially in nite quantities of data prohibit any hope of complete storage ; we need, however, to be able to examine the history of the data streams. This led to the compromise of "summaries" of the data stream and "approximate" results. Today, a huge number of di erent types of data stream summaries have been proposed. However, continuous developments in technology and in corresponding applications demand a similar progress of summary and analysis methods. Moreover, sequential pattern extraction is still little studied : when this thesis began, there were no methods for extracting sequential patterns from data streams. Motivated by this context, we are interested in a method that summarizes the data stream in an e cient and reliable way and that has as main purpose the extraction of sequential patterns. In this thesis, we propose the CLUSO (Clustering, Summarizing and Outlier detection) approach. CLUSO allows us to obtain clusters from a stream of sequences of itemsets, to compute and maintain histories of these clusters and to detect outliers. The contributions detailed in this report concern : Clustering sequences of itemsets in data streams. To the best of our knowledge, it is the rst work in this domain. Summarizing data streams by way of sequential pattern extraction. Summaries given by CLUSO consist of aligned sequential patterns representing clusters associated to their history in the stream. The set of such patterns is a reliable summary of the stream at time t. Managing the history of these patterns is a crucial point in stream analysis. With CLUSO we introduce a new way of managing time granularity in order to optimize this history. Outlier detection. This detection, when related to data streams, must be fast and reliable. More precisely, stream constraints forbid requiring parameters or adjustments from the end-user (ignored outliers or their late detection can be detrimental). Outlier detection in CLUSO is automated and self-adjusting. We also present a case study on real data, written in collaboration with Orange Labs.
منابع مشابه
Extraction de motifs séquentiels dans les flots de données d'usage du Web
Résumé. Ces dernières années, de nouvelles contraintes sont apparues pour les techniques de fouille de données. Ces contraintes sont typiques d’un nouveau genre de données : les “data streams”. Dans un processus de fouille appliqué sur un data stream, l’utilisation de la mémoire est limitée, de nouveaux éléments sont générés en permanence et doivent être traités le plus rapidement possible, auc...
متن کاملExtraction De Motifs Séquentiels Dans Des Données Multidimensionelles. (Mining Sequential Patterns In Multidimensional Data)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau...
متن کاملSPAMS: Une nouvelle approche incrémentale pour l'extraction de motifs séquentiels fréquents dans les data streams
Résumé. L’extraction de motifs séquentiels fréquents dans les data streams est un enjeu important traité par la communauté des chercheurs en fouille de données. Plus encore que pour les bases de données, de nombreuses contraintes supplémentaires sont à considérer de par la nature intrinsèque des streams. Dans cet article, nous proposons un nouvel algorithme en une passe : SPAMS, basé sur la con...
متن کاملPréservation de la vie privée. Recherche de motifs séquentiels dans des bases de données distribuées
Extracting knowledge without disclosing any individual or sensitive information is a new challenging problem for the data mining community. In this paper, we present a new algorithm PRIPSEP (privacy preserving sequential patterns) for the mining of sequential patterns from distributed databases while preserving privacy. We prove that our architecture and protocols employed by our algorithm are ...
متن کاملUne approche centroïde pour la classification de séquences dans les data streams
In recent years, emerging applications introduced new constraints for data mining methods. These constraints are typical of a new kind of data: the data streams. In a data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered as fast as possible, no blocking operator can be performed and the data can be examined only once. At this time ...
متن کاملExtraction de motifs séquentiels. Problèmes et méthodes
SYNOPSIS. Dans un premier temps, le problème de l’extraction de motifs séquentiels peut sembler proche de celui de l’extraction de règles d’association. Ce rapprochement s’avère cependant très fragile en raison d’un élément clé qui est propre à l’extraction de motifs séquentiels : la temporalité. Cette notion permet à la fois de distinguer à l’intérieur des enregistrements un ordre d’apparition...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009